The “crow’s foot” that connects the tables together shows us how the columns in one table relate to the columns in another table. ## Types Of Statements A few statements include: * CREATE TABLE is a statement that creates a new table in a database. * DROP TABLE is a statement that removes a table in a database. * SELECT allows you to read data and display it. This is called a query.
Here you were introduced to two statements that will be used in every query you write * SELECT - what columns you want back. * FROM - what table you are querying from.
SELECT id, account_id, occurred_at
FROM orders;
The LIMIT statement is useful when you want to see just the first few rows of a table. This can be much faster for loading than if we load the entire dataset. The LIMIT command is always the very last part of a query. An example of showing just the first 10 rows of the orders table with all of the columns might look like the following:
SELECT *
FROM orders
LIMIT 10
SELECT id, occurred_at, total_amt_usd
FROM orders
ORDER BY occurred_at
LIMIT 10;
We can ORDER BY more than one column at a time. The statement sorts according to columns listed from left first and those listed on the right after that.
SELECT *
FROM orders
ORDER BY occurred_at, total_amt_usd
LIMIT 10;
Using the WHERE statement, we can subset out tables based on conditions that must be met.
SELECT *
FROM orders
WHERE gloss_amt_usd >= 1000
LIMIT 5;
The WHERE statement can also be used with non-numerical data. We can use the = and != operators here. You also need to be sure to use single quotes
SELECT name, website, primary_poc
FROM accounts
WHERE name = 'Exxon Mobil';
Creating a new column that is a combination of existing columns is known as a derived column. ### Order of Operations Remember BOMDAS from maths class? The same order of operations apply when using arithmetic operators in SQL.
SELECT id, account_id,
poster_amt_usd/(standard_amt_usd + gloss_amt_usd + poster_amt_usd) AS post_per
FROM orders;
LIKE - This allows you to perform operations similar to using WHERE and =, but for cases when you might not know exactly what you are looking for. The LIKE operator is frequently used with %. The % tells us that we might want any number of characters leading up to a particular set of characters or following a certain set of characters.
/* select all the companies whose names start with 'C' */
SELECT name
FROM accounts
WHERE name like 'C%'
/* select All companies whose names contain the string 'one' somewhere in the name. */
SELECT name
FROM accounts
WHERE name LIKE '%one%';
IN - This allows you to perform operations similar to using WHERE and =, but for more than one condition. We can check one, two or many column values for which we want to pull data, but all within the same query. TIP - In most SQL environments, you can use single or double quotation marks - and you may NEED to use double quotation marks if you have an apostrophe within the text you are attempting to pull.
SELECT name, primary_poc, sales_rep_id
FROM accounts
WHERE name IN ('Walmart', 'Target', 'Nordstrom');
NOT - This is used with IN and LIKE to select all of the rows NOT LIKE or NOT IN a certain condition.
/* Use the accounts table to find the account name, primary poc, and sales rep id for all stores except Walmart, Target, and Nordstrom */
SELECT name, primary_poc, sales_rep_id
FROM accounts
WHERE name NOT IN ('Walmart', 'Target', 'Nordstrom');
AND & BETWEEN - These allow you to combine operations where all combined conditions must be true. * The AND operator is used within a WHERE statement to consider more than one logical clause at a time. Each time you link a new statement with an AND, you will need to specify the column you are interested in looking at. * Sometimes we can make a cleaner statement using BETWEEN than we can using AND. Particularly this is true when we are using the same column for different parts of our AND statement. e.g.:
/*Instead of writing : WHERE column >= 6 AND column <= 10 we can instead write*/
WHERE column BETWEEN 6 AND 10
/* Write a query that returns all the orders where the standard_qty is over 1000, the poster_qty is 0, and the gloss_qty is 0 */
SELECT *
FROM orders
WHERE standard_qty > 1000 AND poster_qty = 0 AND gloss_qty = 0;
/* Using the accounts table find all the companies whose names do not start with 'C' and end with 's' */
SELECT name
FROM accounts
WHERE name NOT LIKE 'C%' AND name LIKE '%s';
/* Use the web_events table to find all information regarding individuals who were contacted via organic or adwords and started their account at any point in 2016 sorted from newest to oldest */
SELECT *
FROM web_events
WHERE channel IN ('organic', 'adwords') AND occurred_at BETWEEN '2016-01-01' AND '2017-01-01'
ORDER BY occurred_at DESC;
/* You will notice that using BETWEEN is tricky for dates! While BETWEEN is generally inclusive of endpoints, it assumes the time is at 00:00:00 (i.e. midnight) for dates. This is the reason why we set the right-side endpoint of the period at '2017-01-01'. */
OR - This allow you to combine operations where at least one of the combined conditions must be true. Similar to the AND operator, the OR operator can combine multiple statements. Each time you link a new statement with an OR, you will need to specify the column you are interested in looking at.
/* Find list of orders ids where either gloss_qty or poster_qty is greater than 4000. Only include the id field in the resulting table. */
SELECT id
FROM orders
WHERE gloss_qty > 4000 OR poster_qty > 4000;
/* Write a query that returns a list of orders where the standard_qty is zero and either the gloss_qty or poster_qty is over 1000. */
SELECT *
FROM orders
WHERE standard_qty = 0 AND (gloss_qty > 1000 OR poster_qty > 1000);
/* Find all the company names that start with a 'C' or 'W', and the primary contact contains 'ana' or 'Ana', but it doesn't contain 'eana'*/
SELECT *
FROM accounts
WHERE (name LIKE 'C%' OR name LIKE 'W%')
AND ((primary_poc LIKE '%ana%' OR primary_poc LIKE '%Ana%')
AND primary_poc NOT LIKE '%eana%');
Only return rows that appear in both tables e.g. Using ERD: Pull standard_qty, gloss_qty, and poster_qty from the orders table, and the website and the primary_poc from the accounts table.
SELECT orders.standard_qty, orders.gloss_qty,
orders.poster_qty, accounts.website,
accounts.primary_poc
FROM orders
JOIN accounts
ON orders.account_id = accounts.id
Notice that we need to specify every table a column comes from in the SELECT statement. ### ALIAS When we JOIN tables together, it is nice to give each table an alias. Frequently an alias is just the first letter of the table name.
FROM tablename t1
JOIN tablename2 t2
SELECT a.primary_poc, w.occurred_at, w.channel, a.name
FROM web_events w
JOIN accounts a
ON w.account_id = a.id
WHERE a.name = 'Walmart';
SELECT r.name region, s.name rep, a.name account
FROM sales_reps s
JOIN region r
ON s.region_id = r.id
JOIN accounts a
ON a.sales_rep_id = s.id
ORDER BY a.name;
SELECT r.name region, a.name account,
o.total_amt_usd/(o.total + 0.01) unit_price
FROM region r
JOIN sales_reps s
ON s.region_id = r.id
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id;
If we want to include data that only exists in one of the tables we ant to join we can use an outer join If there is not matching information in the JOINed table, then you will have columns with empty cells with a data type called NULL. Writing a LEFT join statement:
Comparing LEFT and RIGHT join:
LEFT and RIGHT joins are essentially inter changeable so its standard to just use LEFT joins.
If you wanted to filter your results so that it only showed sales from a particular sales rep you could use WHERE but it’s better to use AND as part of the ON statement. A simple rule to remember this is that, when the database executes this query, it executes the join and everything in the ON clause first. Think of this as building the new result set. That result set is then filtered using the WHERE clause.
SELECT orders.*,
accounts.*
FROM orders
LEFT JOIN accounts
ON orders.account_id = accounts.id
/* Use AND in stead of WHERE for filtering this LEFT JOIN */
AND accounts.sales_rep_id = 325
SELECT r.name region, s.name rep, a.name account
FROM sales_reps s
JOIN region r
ON s.region_id = r.id
JOIN accounts a
ON a.sales_rep_id = s.id
WHERE r.name = 'Midwest'
ORDER BY a.name;
SELECT r.name region, s.name rep, a.name account
FROM sales_reps s
JOIN region r
ON s.region_id = r.id
JOIN accounts a
ON a.sales_rep_id = s.id
WHERE r.name = 'Midwest' AND s.name LIKE 'S%'
ORDER BY a.name;
SELECT r.name region, s.name rep, a.name account
FROM sales_reps s
JOIN region r
ON s.region_id = r.id
JOIN accounts a
ON a.sales_rep_id = s.id
WHERE r.name = 'Midwest' AND s.name LIKE '% K%'
ORDER BY a.name;
SELECT r.name region, a.name account, o.total_amt_usd/(o.total + 0.01) unit_price
FROM region r
JOIN sales_reps s
ON s.region_id = r.id
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
WHERE o.standard_qty > 100;
SELECT r.name region, a.name account, o.total_amt_usd/(o.total + 0.01) unit_price
FROM region r
JOIN sales_reps s
ON s.region_id = r.id
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
WHERE o.standard_qty > 100 AND o.poster_qty > 50
ORDER BY unit_price;
SELECT r.name region, a.name account, o.total_amt_usd/(o.total + 0.01) unit_price
FROM region r
JOIN sales_reps s
ON s.region_id = r.id
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
WHERE o.standard_qty > 100 AND o.poster_qty > 50
ORDER BY unit_price DESC;
SELECT DISTINCT a.name, w.channel
FROM accounts a
RIGHT JOIN web_events w
ON a.id = w.account_id
WHERE a.id = '1001';
SELECT o.occurred_at, a.name, o.total, o.total_amt_usd
FROM accounts a
JOIN orders o
ON o.account_id = a.id
WHERE o.occurred_at BETWEEN '01-01-2015' AND '01-01-2016'
ORDER BY o.occurred_at DESC;
When identifying NULLs in a WHERE clause, we write IS NULL or IS NOT NULL. We don’t use =, because NULL isn’t considered a value in SQL.
Count the number of rows in a table
SELECT COUNT(*)
FROM accounts;
COUNT does not consider rows that have NULL values. Therefore, this can be useful for quickly identifying which rows have missing data.
Unlike COUNT, you can only use SUM on numeric columns. However, SUM will ignore NULL values and treat them as 0.
SELECT SUM(poster_qty) AS total_poster_sales
FROM orders;
SELECT SUM(standard_qty) AS total_standard_sales
FROM orders;
SELECT SUM(total_amt_usd) AS total_dollar_sales
FROM orders;
SELECT standard_amt_usd + gloss_amt_usd AS total_standard_gloss
FROM orders;
SELECT SUM(standard_amt_usd)/SUM(standard_qty) AS standard_price_per_unit
FROM orders;
MIN and MAX are aggregators that ignore NULL values. Functionally, MIN and MAX are similar to COUNT in that they can be used on non-numerical columns. Depending on the column type, MIN will return the lowest number, earliest date, or non-numerical value as early in the alphabet as possible. As you might suspect, MAX does the opposite—it returns the highest number, the latest date, or the non-numerical value closest alphabetically to “Z.”
SELECT MIN(occurred_at)
FROM orders;
Try performing the same query as in question 1 without using an aggregation function.
SELECT occurred_at
FROM orders
ORDER BY occurred_at
LIMIT 1;
SELECT MAX(occurred_at)
FROM web_events;
Try to perform the result of the previous query without using an aggregation function.
SELECT occurred_at
FROM web_events
ORDER BY occurred_at DESC
LIMIT 1;
AVG returns the mean of the data - that is the sum of all of the values in the column divided by the number of values in a column. This aggregate function ignores the NULL values in both the numerator and the denominator.
SELECT AVG(standard_qty) mean_standard, AVG(gloss_qty) mean_gloss,
AVG(poster_qty) mean_poster, AVG(standard_amt_usd) mean_standard_usd,
AVG(gloss_amt_usd) mean_gloss_usd, AVG(poster_amt_usd) mean_poster_usd
FROM orders;
SELECT *
FROM (SELECT total_amt_usd
FROM orders
ORDER BY total_amt_usd
LIMIT 3457) AS Table1
ORDER BY total_amt_usd DESC
LIMIT 2;
Since there are 6912 orders - we want the average of the 3457 and 3456 order amounts when ordered. This is the average of 2483.16 and 2482.55. This gives the median of 2482.855. This obviously isn’t an ideal way to compute. If we obtain new orders, we would have to change the limit. SQL didn’t even calculate the median for us. The above used a SUBQUERY, but you could use any method to find the two necessary values, and then you just need the average of them.
GROUP BY can be used to aggregate data within subsets of the data. For example, grouping for different accounts, different regions, or different sales representatives. Any column in the SELECT statement that is not within an aggregator must be in the GROUP BY clause. The GROUP BY always goes between WHERE and ORDER BY. ORDER BY works like SORT in spreadsheet software. SQL evaluates the aggregations before the LIMIT clause. If you don’t group by any columns, you’ll get a 1-row result—no problem there. If you group by a column with enough unique values that it exceeds the LIMIT number, the aggregates will be calculated, and then some rows will simply be omitted from the results. This is a nice way to do things because you know you’re going to get the correct aggregates. If SQL cuts the table down to 100 rows, then performed the aggregations, your results would be substantially different.
Note: You can reference the columns in your select statement in GROUP BY and ORDER BY clauses with numbers that follow the order they appear in the select statement. For example
SELECT standard_qty, COUNT(*)
FROM orders
GROUP BY 1 /*(this 1 refers to standard_qty since it is the first of the columns included in the select statement)*/
ORDER BY 1 /*(this 1 refers to standard_qty since it is the first of the columns included in the select statement)*/
SELECT a.name, o.occurred_at
FROM accounts a
JOIN orders o
ON a.id = o.account_id
ORDER BY occurred_at
LIMIT 1;
SELECT a.name, SUM(total_amt_usd) total_sales
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.name;
SELECT w.occurred_at, w.channel, a.name
FROM web_events w
JOIN accounts a
ON w.account_id = a.id
ORDER BY w.occurred_at DESC
LIMIT 1;
SELECT w.channel, COUNT(*)
FROM web_events w
GROUP BY w.channel
SELECT a.primary_poc
FROM web_events w
JOIN accounts a
ON a.id = w.account_id
ORDER BY w.occurred_at
LIMIT 1;
SELECT a.name, MIN(total_amt_usd) smallest_order
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name
ORDER BY smallest_order;
Sort of strange we have a bunch of orders with no dollars. We might want to look into those.
SELECT r.name, COUNT(*) num_reps
FROM region r
JOIN sales_reps s
ON r.id = s.region_id
GROUP BY r.name
ORDER BY num_reps;
You can GROUP BY multiple columns at once. This is often useful to aggregate across a number of different segments. The order of columns listed in the ORDER BY clause does make a difference. You are ordering the columns from left to right. The order of column names in your GROUP BY clause doesn’t matter—the results will be the same regardless.
SELECT a.name, AVG(o.standard_qty) avg_stand, AVG(o.gloss_qty) avg_gloss, AVG(o.poster_qty) avg_post
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name;
SELECT a.name, AVG(o.standard_amt_usd) avg_stand, AVG(o.gloss_amt_usd) avg_gloss, AVG(o.poster_amt_usd) avg_post
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name;
SELECT s.name, w.channel, COUNT(*) num_events
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
JOIN sales_reps s
ON s.id = a.sales_rep_id
GROUP BY s.name, w.channel
ORDER BY num_events DESC;
SELECT r.name, w.channel, COUNT(*) num_events
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
JOIN sales_reps s
ON s.id = a.sales_rep_id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name, w.channel
ORDER BY num_events DESC;